Lab #03: Probability and Teamwork

due Friday, September 16, 4:00 PM

Goals

Getting started

Log in to GitHub to determine your team number and members for Lab 3.

Every team member should now go to the course GitHub organization and locate your lab 3 repository, which should have the prefix lab-03. Clone the repository by copying the url in Github under the SSH tab in the Code drop-down and creating a New Project using version control in RStudio. If you have trouble, see the first lab for step-by-step instructions or ask a teammate for help. Do not edit the .Rmd file until explicitly asked to do so in the instructions.

Monkeypox

Monkeypox virus is endemic in central and west Africa, and in 2022 a significant global outbreak is occurring in non-endemic areas.

Our World in Data curates a number of really interesting data resources. In this lab we use their monkeypox data repository, which is updated daily. We start by visualizing these data. (Note: I’ve hidden the code below, but you can look at it in the class organization under website/docs/slides/week-03/lab-03-prob-teams.Rmd . You can change the variable plotdate in the code to look at total cases per million population on another day, if desired, but this lab focuses on August 22, 2022.)

Hot Tip

For this lab, you might want to do some pencil and paper calculations and then turn them in. If you know LaTeX-style equation coding, you can use that in R Markdown. If not, you can include an external image (e.g., a picture from your phone of your paper) easily. For example, someone in my family will be getting the following holiday gift from Snorg Tees (good advice in general!). The code to include it is below, and make sure that you save/upload the picture file in the same folder as this Rmd file:

![](axolotl.png)

Team workflow

Assign each team member a number 1 through 4 and write your number down on a piece of paper. This lab will walk you through the basics of team workflow step-by-step. If your team has just three members, use your favorite method (e.g., rock-paper-scissors) to randomly assign one member to be team member 4 as well.

Do the following exercises in order, following each step carefully.

Only one person at a time should type in the .Rmd file and push updates.

The person working should share their screen, and the others should follow along.

Team member 1: Open the lab3.Rmd file and change the author of the YAML header to the following “Team Number: Member 1, Member 2, Member 3, Member 4” with your team number (for example Team 3) and the first and last names of all team members.

Team member 1: Run the subset-data code chunk to subset the data to August 22, select only location, new cases, and total cases (not standardized to population), and print the first 6 rows and the last 6 rows. Share the results with your team members. Then, answer the questions below. (For the rest of the assignment, we will consider only cases through August 22.)

library(tidyverse)

plotdate = "2022-08-22"

#pick off date of interest
case_day <- case_series %>%
  filter(date == plotdate) %>%
  select(location,new_cases,total_cases)

head(case_day,6) #you'll want to modify this line!
##    location new_cases total_cases
## 1 Argentina         0          72
## 2 Australia         0          89
## 3   Austria         0         218
## 4   Belgium        47         671
## 5   Bolivia         0          43
## 6    Brazil       109        3896
tail(case_day,6)
##         location new_cases total_cases
## 48      Thailand         0           5
## 49        Turkey         0           5
## 50       Uruguay         1           3
## 51 United States      1308       15357
## 52     Venezuela         0           1
## 53         World      2063       44275
  1. Run the code below to add up the total cases of countries in the data. Compare this to the world count provided in the data (you can see that by arranging the data from largest to smallest counts). Address the extent to which these values are consistent with each other.
# code to add up the total cases in the dataset not attributed to "World"
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
totals <- case_day %>%
  filter(location != "World") %>%
  adorn_totals("row")

totals
##            location new_cases total_cases
##           Argentina         0          72
##           Australia         0          89
##             Austria         0         218
##             Belgium        47         671
##             Bolivia         0          43
##              Brazil       109        3896
##              Canada        10        1179
##         Switzerland        17         416
##               Chile        63         270
##            Colombia       109         273
##                Cuba         0           1
##              Cyprus         0           4
##             Czechia         0          39
##             Germany        29        3295
##             Denmark         6         169
##  Dominican Republic         0           7
##             Ecuador         0          20
##               Spain         0        6119
##             Estonia         0           9
##              France         0        2888
##      United Kingdom       145        3346
##               Ghana         0          47
##              Greece         2          52
##           Guatemala         0           4
##              Guyana         1           1
##            Honduras         0           3
##             Croatia         0          22
##             Hungary         1          63
##             Ireland         0         126
##              Israel         5         213
##               Italy         0         689
##             Jamaica         0           4
##          Luxembourg         0          45
##             Morocco         0           1
##              Mexico       134         386
##          Montenegro         1           2
##         Netherlands         1        1092
##              Norway         0          76
##              Panama         0           7
##                Peru        60        1188
##              Poland        10         114
##         Puerto Rico         2          77
##            Portugal         0         810
##             Romania         0          34
##        Saudi Arabia         0           6
##            Slovakia         2          12
##              Sweden         0         141
##            Thailand         0           5
##              Turkey         0           5
##             Uruguay         1           3
##       United States      1308       15357
##           Venezuela         0           1
##               Total      2063       43610

Team member 1: When you have finished, knit to PDF, then stage, commit, and push your .Rmd and PDF to GitHub with an appropriate commit message.

All other team members: Once your team member has pushed the work, pull to get the updated documents from GitHub. Click on the .Rmd file and you should see the responses to the first two exercises. Knit the file to update your own documents.

Team member 2: It’s your turn. Answer the question below.

  1. Calculate the probability that a case is from Mexico. Then calculate the probability that a case is from the US (including Puerto Rico) and the probability a case is from Canada. In order to do these calculations, assume the sum of the reported counts in the countries (not including the World category) is the correct number of worldwide cases as of the data of interest. (The code below shows how to calculate these percentages in R; note that Puerto Rico is reported separately from the US.) Report the probabilities up to two decimal places and include a sentence describing your findings.
options(scipen=999)
case_day %>%
  filter(location != "World") %>%
  #the mutate command creates a variable that gives the percent of total cases from each location
  mutate(percentcase=total_cases/sum(total_cases))
##              location new_cases total_cases   percentcase
## 1           Argentina         0          72 0.00165099748
## 2           Australia         0          89 0.00204081633
## 3             Austria         0         218 0.00499885347
## 4             Belgium        47         671 0.01538637927
## 5             Bolivia         0          43 0.00098601238
## 6              Brazil       109        3896 0.08933730796
## 7              Canada        10        1179 0.02703508370
## 8         Switzerland        17         416 0.00953909654
## 9               Chile        63         270 0.00619124054
## 10           Colombia       109         273 0.00626003210
## 11               Cuba         0           1 0.00002293052
## 12             Cyprus         0           4 0.00009172208
## 13            Czechia         0          39 0.00089429030
## 14            Germany        29        3295 0.07555606512
## 15            Denmark         6         169 0.00387525797
## 16 Dominican Republic         0           7 0.00016051364
## 17            Ecuador         0          20 0.00045861041
## 18              Spain         0        6119 0.14031185508
## 19            Estonia         0           9 0.00020637468
## 20             France         0        2888 0.06622334327
## 21     United Kingdom       145        3346 0.07672552167
## 22              Ghana         0          47 0.00107773446
## 23             Greece         2          52 0.00119238707
## 24          Guatemala         0           4 0.00009172208
## 25             Guyana         1           1 0.00002293052
## 26           Honduras         0           3 0.00006879156
## 27            Croatia         0          22 0.00050447145
## 28            Hungary         1          63 0.00144462279
## 29            Ireland         0         126 0.00288924559
## 30             Israel         5         213 0.00488420087
## 31              Italy         0         689 0.01579912864
## 32            Jamaica         0           4 0.00009172208
## 33         Luxembourg         0          45 0.00103187342
## 34            Morocco         0           1 0.00002293052
## 35             Mexico       134         386 0.00885118092
## 36         Montenegro         1           2 0.00004586104
## 37        Netherlands         1        1092 0.02504012841
## 38             Norway         0          76 0.00174271956
## 39             Panama         0           7 0.00016051364
## 40               Peru        60        1188 0.02724145838
## 41             Poland        10         114 0.00261407934
## 42        Puerto Rico         2          77 0.00176565008
## 43           Portugal         0         810 0.01857372162
## 44            Romania         0          34 0.00077963770
## 45       Saudi Arabia         0           6 0.00013758312
## 46           Slovakia         2          12 0.00027516625
## 47             Sweden         0         141 0.00323320339
## 48           Thailand         0           5 0.00011465260
## 49             Turkey         0           5 0.00011465260
## 50            Uruguay         1           3 0.00006879156
## 51      United States      1308       15357 0.35214400367
## 52          Venezuela         0           1 0.00002293052

Team member 2: Knit to PDF, then stage, commit, and push your .Rmd and PDF to GitHub with an appropriate commit message.

All other team members: Once your team member has pushed the work, pull to get the updated documents from GitHub. Click on the .Rmd file and you should see the responses to the first three exercises. Knit the file.

Team member 3: It’s your turn. Complete the exercise below.

  1. Create a segmented bar chart, with each bar going from 0-1, with the country names along the y-axis and horizontal bars illustrating the fraction of new (as of August 22) and prior cases (before August 22) for each country. Use informative labels and titles. Which country has the highest percentage of new cases on August 22? Comment on whether this is a cause for major concern and why.

The starter code rearranges the cases to facilitate the plotting, creating a new dataset tidycase. Some plotting options are supplied to get you started, but you’ll need to fill out the code to create the plot!

# create new variable percent new cases
# store in same dataset
# change name of variable new_cases to new
case_day <- case_day %>%
  mutate(old=total_cases-new_cases, new=new_cases) %>% #this drops the old new_cases variable
  select(-new_cases)

#this formats the data for better plotting
# we'll learn more about this in data wrangling notes
# we're making two observations per country - one for new and one for old cases

tidycase <- case_day %>%
  filter(location != "World") %>% #drop entire world summaries
  pivot_longer(cols=c("new","old"),
               names_to = "type",
               values_to = "count")  %>%
  select(location, type, count) %>%
  mutate(type=as.factor(type)) 

#take a peek at new dataset
head(tidycase)
## # A tibble: 6 × 3
##   location  type  count
##   <chr>     <fct> <dbl>
## 1 Argentina new       0
## 2 Argentina old      72
## 3 Australia new       0
## 4 Australia old      89
## 5 Austria   new       0
## 6 Austria   old     218
# now finally make the plot!
# this is commented out for now because it doesn't run 
# until you make edits!

#tidycase %>%
#  ggplot(aes(x = , y = , fill= )) +

# note the stat="identity" option is needed because our data
# have been summarized already (new and old cases have already
# been counted)

#  geom_bar(stat="identity",position="fill")  + 

# element text makes the font size larger or smaller; play with this

#theme(axis.text = element_text(size = 4)) +
# labs()

Team member 3: Knit to PDF, then stage, commit, and push your .Rmd and PDF to GitHub with an appropriate commit message.

All other team members: Once your team member has pushed the work, pull to get the updated documents from GitHub. Click on the .Rmd file and you should see the responses to the first four exercises. Knit the file.

Team member 4: It’s your turn. Complete the exercise below.

  1. What is the conditional probability that a North American monkeypox case is from Canada? Calculate this conditional probability along with the corresponding probabilities that a North American case is from the US (including Puerto Rico) and that a North American case is from Mexico. Percentages of North American cases by country are provided by the code below.
case_day %>%
  #these are the North American countries with cases up to August 22
  filter(location %in% c("United States","Canada","Puerto Rico","Panama","Dominican Republic","Guatemala","Mexico")) %>%
  # this code calculates percent of cases from each location from the North American countries in the filter statement above
  mutate(percent=total_cases/sum(total_cases))
##             location total_cases   old  new      percent
## 1             Canada        1179  1169   10 0.0692836575
## 2 Dominican Republic           7     7    0 0.0004113534
## 3          Guatemala           4     4    0 0.0002350591
## 4             Mexico         386   252  134 0.0226831992
## 5             Panama           7     7    0 0.0004113534
## 6        Puerto Rico          77    75    2 0.0045248869
## 7      United States       15357 14049 1308 0.9024504907

Team member 4: Knit to PDF, then stage, commit, and push your .Rmd and PDF to GitHub with an appropriate commit message.

All other team members: Once your team member has pushed the work, pull to get the updated documents from GitHub. Click on the .Rmd file and you should see the responses to the first four exercises. Knit the file.

Team member 1: It’s your turn again. Answer the question below with help from your team.

  1. You might wonder whether the probability that someone is a monkeypox case is related to their country of residence. Another way to think of is to explore whether country of residence and monkeypox infection status are independent. Let’s explore this only in North American countries that have reported at least one monkeypox case, given that monkeypox is not endemic there (we could just as easily have chosen another non-endemic continent). While ideally we would include all North American countries, for this example, we limit to the ones for which monkeypox case rates are non-zero, as otherwise we need an external source to get population size.

That is, we will get a very rough estimate of each country’s population by calculating the approximate population as the number of cases divided by the cases per million population rate times one million.

Let \(A\) be the event a person in this set of data is a US resident, and let \(B\) be the event a person has monkeypox. If country of residence and infection status are independent in these North American countries, then \(P(A|B) = P(A)\). If this condition is satisfied, then we’d want to check the condition for other countries in North America to be sure country and infection are independent. If the condition is not satisfied for the US, than the two variables are not independent, and we don’t have to bother checking other countries.

You may find the following code helpful!

# creates new dataset, data5, that contains total cases as of August 22, and population of each country, among the North American countries with cases
data5 <- case_series %>%
    filter(date == plotdate) %>%
    filter(location %in% c("United States","Canada","Puerto Rico","Panama","Dominican Republic","Guatemala","Mexico")) %>%
  mutate(approxpop = 1000000*total_cases/total_cases_per_million) %>%
  select(location,approxpop,total_cases)

data5
##             location approxpop total_cases
## 1             Canada  38155340        1179
## 2 Dominican Republic  11111111           7
## 3          Guatemala  17621145           4
## 4             Mexico 126723572         386
## 5             Panama   4350528           7
## 6        Puerto Rico   3256089          77
## 7      United States 336998025       15357

Team member 1: When you have finished, knit to PDF, then stage, commit, and push your .Rmd and PDF to GitHub with an appropriate commit message.

All other team members: Once your team member has pushed the work, pull to get the updated documents from GitHub. Click on the .Rmd file and you should see the responses to the first two exercises. Knit the file to update your own documents.

Team member 2: It’s your turn. Answer the question below.

  1. Now let’s go back to the original data, case_series. The variable new_cases shows the number of new cases reported each day of the outbreak. Filter the data to include the country of your choice and create a scatterplot showing the case trend over time in this country. Describe this trend in a couple of sentences.

Because the variable date in that data set is viewed as a character rather than date variable, we first need to reformat it using the code below.

# change formatting of date from character to date format
case_series <- case_series %>% #make a change and save to dataset of same name
  mutate(date=as.Date(date,'%Y-%m-%d'))

# ggplot tip: this code below angles the x axis tick labels
# helpful if the labels overlap each other and are hard to read
# + theme(axis.text.x=element_text(angle=60, hjust=1)) 

Team member 2: When you have finished, knit to PDF, then stage, commit, and push your .Rmd and PDF to GitHub with an appropriate commit message.

All other team members: Once your team member has pushed the work, pull to get the updated documents from GitHub. Click on the .Rmd file to see your final version of the lab.

Team member 3: Upload your team’s PDF to Gradescope. Include every team member’s name in the Gradescope submission and identify which problems are on each page in Gradescope. Associate the “Overall” section with the first page of your PDF.

There should only be one submission per team on Gradescope.

Grading

Total: 50 pts